1 Introduction

This script includes replication material for the paper Diagnosing Gender Bias in Image Recognition Systems. This script uses several data sets, located in the data folder of this repository:

  • mc_data.tsv: includes socio-demographic data for Members of the 15th US Congress (MCs) and annotations of image recognition platforms for their professional photographs.
  • twitter_data.tsv: includes data on images from Tweets posted by MCs.
  • crowdvalidation_data.tsv: includes results from a crowd-sourced validation sample of the images tweeted by MCs.
  • imagelabels_cat_data.tsv: includes manually coded categories of image annotations of MC photographs.

All datasets are tab separated and encoded in UTF-8. In addition, the replication material includes photographs of MCs (which fall under the public domain) in the folder mc_images.

Please note that due to Twitter`s Terms of Service we are not allowed to include raw tweets or images from tweets in the replication material. In addition, image recognition platforms regularly change their inner workings (often silently). For that reason, we also did not include any code in this material for data extraction and data annotation using the image recognition tools. Instead, we directly added the annotations obtained from image recognition platforms to the datasets. If desired, tweets and images can nevertheless be collected using Tweet id’s we included in the data. This can for instance by done with the R package rtweet.

Furthermore, if you are interested in working with image recognition software, Carsten Schwemmer created the R package imgrec to accompany the paper. As outlined in the package vignette, imgrec can be used to easily retrieve and analyze image annotations from Google Cloud Vision. A short tutorial for hydrating tweets and downloading corresponding images is also available here.

2 Software Setup

We use the following R packages and versions throughout the document:

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
## [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] scales_1.1.1      viridis_0.5.1     viridisLite_0.3.0 AER_1.2-9        
##  [5] survival_2.44-1.1 sandwich_2.5-1    lmtest_0.9-38     zoo_1.8-8        
##  [9] car_3.0-9         quanteda_2.1.2    tidytext_0.2.6    lubridate_1.7.9  
## [13] stargazer_5.2.2   hrbrthemes_0.8.0  effects_4.2-0     carData_3.0-4    
## [17] ggeffects_0.16.0  cowplot_1.1.0     ggrepel_0.8.2     magick_2.4.0     
## [21] sjPlot_2.8.5      extrafont_0.17    forcats_0.5.0     stringr_1.4.0    
## [25] dplyr_1.0.2       purrr_0.3.4       readr_1.3.1       tidyr_1.1.2      
## [29] tibble_3.0.3      ggplot2_3.3.2     tidyverse_1.3.0   MASS_7.3-51.4    
## 
## loaded via a namespace (and not attached):
##  [1] minqa_1.2.4        colorspace_1.4-1   rio_0.5.16         ellipsis_0.3.1    
##  [5] sjlabelled_1.1.7   estimability_1.3   parameters_0.8.6   fs_1.5.0          
##  [9] rstudioapi_0.11    SnowballC_0.7.0    fansi_0.4.1        mvtnorm_1.1-1     
## [13] xml2_1.3.2         splines_3.6.1      knitr_1.30         sjmisc_2.8.5      
## [17] Formula_1.2-3      jsonlite_1.7.1     nloptr_1.2.2.2     broom_0.7.0       
## [21] Rttf2pt1_1.3.8     dbplyr_1.4.4       effectsize_0.3.3   compiler_3.6.1    
## [25] httr_1.4.2         sjstats_0.18.0     emmeans_1.5.1      backports_1.1.10  
## [29] assertthat_0.2.1   Matrix_1.2-17      survey_4.0         cli_2.0.2         
## [33] htmltools_0.5.0    tools_3.6.1        coda_0.19-3        gtable_0.3.0      
## [37] glue_1.4.2         fastmatch_1.1-0    Rcpp_1.0.5         cellranger_1.1.0  
## [41] vctrs_0.3.4        nlme_3.1-140       extrafontdb_1.0    insight_0.9.6     
## [45] xfun_0.17          stopwords_2.0      openxlsx_4.2.2     lme4_1.1-23       
## [49] rvest_0.3.6        lifecycle_0.2.0    statmod_1.4.34     hms_0.5.3         
## [53] curl_4.3           yaml_2.2.1         gridExtra_2.3      gdtools_0.2.2     
## [57] stringi_1.4.6      bayestestR_0.7.2   tokenizers_0.2.1   zip_2.1.1         
## [61] boot_1.3-22        cpp11_0.2.1        rlang_0.4.7        pkgconfig_2.0.3   
## [65] systemfonts_0.3.1  evaluate_0.14      lattice_0.20-38    tidyselect_1.1.0  
## [69] magrittr_1.5       R6_2.4.1           generics_0.0.2     DBI_1.1.0         
## [73] foreign_0.8-71     pillar_1.4.6       haven_2.3.1        withr_2.3.0       
## [77] abind_1.4-5        nnet_7.3-12        performance_0.5.0  janeaustenr_0.1.5 
## [81] modelr_0.1.8       crayon_1.3.4       rmarkdown_2.3      usethis_1.6.3     
## [85] grid_3.6.1         readxl_1.3.1       data.table_1.13.0  blob_1.2.1        
## [89] reprex_0.3.0       digest_0.6.25      xtable_1.8-4       RcppParallel_5.0.2
## [93] munsell_0.5.0      mitools_2.4

3 Biases in Image Labels of Professional Photographs

First, we load data of MCs and image annotations for their professional photographs which we extracted from their Wikipedia pages.

## # A tibble: 6 x 16
##   name    age gender ethnicity type  party state user_screen_name wikipedia_id
##   <chr> <dbl> <chr>  <chr>     <chr> <chr> <chr> <chr>            <chr>       
## 1 Sher~    66 Male   white     Sena~ Demo~ OH    SenSherrodBrown  Sherrod_Bro~
## 2 Mari~    60 Female white     Sena~ Demo~ WA    SenatorCantwell  Maria_Cantw~
## 3 Benj~    75 Male   white     Sena~ Demo~ MD    SenatorCardin    Ben_Cardin  
## 4 Thom~    71 Male   white     Sena~ Demo~ DE    SenatorCarper    Tom_Carper  
## 5 Robe~    58 Male   white     Sena~ Demo~ PA    SenBobCasey      Bob_Casey_J~
## 6 Bob ~    66 Male   white     Sena~ Repu~ TN    SenBobCorker     Bob_Corker  
## # ... with 7 more variables: wiki_img_url <chr>, wiki_img_labels <chr>,
## #   wiki_img_labelsconf <chr>, wiki_img_labels_macv <chr>,
## #   wiki_img_labels_awsrek <chr>, wiki_img_labelsconf_awsrek <chr>,
## #   wiki_img_objects <chr>

The following chunk provides some descriptive statistics about the composition of this dataset.

## # A tibble: 3 x 2
##   party           n
##   <chr>       <int>
## 1 Democrat      243
## 2 Independent     2
## 3 Republican    295
## # A tibble: 2 x 2
##   gender     n
##   <chr>  <int>
## 1 Female   110
## 2 Male     430
## [1] 60.42593
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   gender   age
##   <chr>  <dbl>
## 1 Female  62.2
## 2 Male    60.0

As we see above, there are 243 Democrats, 295 Republicans, and 2 Independents. There are 110 women and 430 men. The average age in this dataset is 60.4259259, with the average age among women being 62.2363636 and the average age among men being 59.9627907.

3.1 Annotation examples

In the following, we provide examples of image annotations by Google Cloud Vision (GCV), Amazon Rekognition (AmRek) and Microsoft Azure Computer Vision (MACV).

Please note: at the time of releasing this replication document, a bug in the magick package might lead to errors for creating the following set of images. For this reason, we set eval to FALSE for these code chunks. We assume this bug will be fixed in later versions of the package as it is already filed as an issue on Github.

3.2 Image labels - key differences

Next, we quantify key differences in labels by gender for each image recognition service. We only consider labels that were assigned at least 5 times across all images.

## [1] 540  61
## [1] 540  70
## [1] 540  80

We create a function to examine group differences based on Chi² tests.

3.3 Manual coding of image labels

For the remainder of this document, we will focus on analysis of GCV output. To get a better understanding for what kind of labels might be biased, we manually coded all labels returned by GCV for the photographs into a smaller set of categories: ‘physical trait / body’, ‘occpupation’, ‘clothing / apparel’, ‘color / adjective’ and ‘other’.

## # A tibble: 6 x 2
##   label      cat               
##   <chr>      <chr>             
## 1 actor      occupation        
## 2 afro       physicaltrait_body
## 3 bangs      physicaltrait_body
## 4 beard      physicaltrait_body
## 5 beauty     physicaltrait_body
## 6 black_hair physicaltrait_body

This code creates category counts for each photograph:

##   clothing_apparel color_adjective occupation other physicaltrait_body
## 1                4               1          5     1                  0
## 2                0               0          8     0                  0
## 3                0               0          5     1                  0
## 4                3               1          5     1                  1
## 5                5               1          4     0                  1
## 6                3               1          7     0                  2

3.4 Modeling Label Counts

Using negative binomial regression models, we model category counts and include covariates for gender, ethnicity, and party of MCs.

## # A tibble: 7 x 2
##   ethnicity                   n
##   <chr>                   <int>
## 1 white                     428
## 2 african_american           49
## 3 hispanic                   44
## 4 asian                      15
## 5 indian                      2
## 6 african_american, asian     1
## 7 hispanic, asian             1

Due to the skewed distribution we only differentiate between “White” and “Non-White” for ethnicity.

## 
##  Overdispersion test
## 
## data:  apparel
## z = 2.0246, p-value = 0.02146
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 0.132935
## 
##  Overdispersion test
## 
## data:  physical
## z = 5.829, p-value = 2.789e-09
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##    alpha 
## 1.265828
## 
##  Overdispersion test
## 
## data:  occupation
## z = -12.277, p-value = 1
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##      alpha 
## -0.4980232
## 
##  Overdispersion test
## 
## data:  color_adjective
## z = -4.5435, p-value = 1
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##      alpha 
## -0.2874472
## 
##  Overdispersion test
## 
## data:  other
## z = -11.551, p-value = 1
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##      alpha 
## -0.4511799

Now we combine all estimates and visualize effect estimates for different covariates, while holding other respective covariates at their observed values.

The following code creates figure 7 from the main paper and shows the average predicted category counts conditional on gender:

We repeat this procedure for ethnicity. The plot is included in the supplementary material:

The next plot shows predicted counts by party (Republicans vs Democrats) and is also included in the supplementary material:

At last, we examine differences in predicted label counts with regards to the age of politicians.

3.5 Object Recognition in Photographs

GCV’s object recognition also tries to identify persons and their gender. In the following, we examine error rates by gender.

# creating vectors
mc_df_obj <- mc_df %>% mutate(
    objects = str_split(wiki_img_objects, ', '),
    obj_person =  objects %>%  
      map_chr(~  ifelse('Person' %in% ., 'recognized', 'not recognized')),
        obj_men =  objects %>%  
      map_chr(~  ifelse('Man' %in% ., 'recognized', 'not recognized')),
    obj_women =  objects %>%  
      map_chr(~  ifelse('Woman' %in% ., 'recognized', 'not recognized')),
    men = ifelse(gender == 'Male', 'in image', 'not in image'),
   women = ifelse(gender == 'Female', 'in image', 'not in image')
    )
# adding vectors in new data frame
obj_vector <- c(mc_df_obj$obj_men, mc_df_obj$obj_women)
person_vector <- c(mc_df_obj$men, mc_df_obj$women)
gender_vector <- c(rep('Men', length(mc_df_obj$obj_men)),
                   rep('Women', length(mc_df_obj$obj_women)))

wiki_obj_df <- tibble(results = obj_vector,
               wiki = person_vector,
               gender = gender_vector) %>%
  mutate(combined = str_c(wiki, results, sep = ', '))

# calculating performance metrics
wiki_obj_errors <- 
  wiki_obj_df %>% group_by(wiki, gender) %>% 
  count(results) %>%
  mutate(rel = n / sum(n)) %>% ungroup() %>%
  mutate(
    gender = factor(gender, levels = c('Women', 'Men')),
    wiki = as.factor(wiki),
    results = as.factor(results),
    stat = case_when(
      wiki == 'in image' & results == 'recognized' ~ 'True Positive',
      wiki == 'not in image' &
        results == 'recognized' ~ 'False Positive',
      wiki == 'not in image' &
        results == 'not recognized' ~ 'True Negative',
      TRUE ~ 'False Negative'
    ) %>%
      factor(
        levels = c(
          'True Positive',
          'False Positive',
          'True Negative',
          'False Negative'
        )
      ),
    results_label = case_when(
      results == "recognized" ~ "Recognized\nby GCV",
      results == "not recognized" ~ "Not recognized\nby GCV"
    )
    
  )

wiki_obj_errors
## # A tibble: 7 x 7
##   wiki        gender results          n    rel stat        results_label        
##   <fct>       <fct>  <fct>        <int>  <dbl> <fct>       <chr>                
## 1 in image    Men    not recogni~    61 0.142  False Nega~ "Not recognized\nby ~
## 2 in image    Men    recognized     369 0.858  True Posit~ "Recognized\nby GCV" 
## 3 in image    Women  not recogni~    27 0.245  False Nega~ "Not recognized\nby ~
## 4 in image    Women  recognized      83 0.755  True Posit~ "Recognized\nby GCV" 
## 5 not in ima~ Men    not recogni~   108 0.982  True Negat~ "Not recognized\nby ~
## 6 not in ima~ Men    recognized       2 0.0182 False Posi~ "Recognized\nby GCV" 
## 7 not in ima~ Women  not recogni~   430 1      True Negat~ "Not recognized\nby ~

The following graph is included as Figure 3 in our main paper and visualizes the performance metrics by gender:

We also qualitatively compared some images to examine why some women may have been recognized while others are not. Here are some example images, which are also included in the supplementary material:

## [1] "mc_images/Maria_Cantwell.jpg"
## [1] "mc_images/Kathy_Castor.jpg"

## [1] "mc_images/Ben_Cardin.jpg"
## [1] "mc_images/Tom_Carper.jpg"

4 Biases in Image Labels of Tweets by MCs

Now, we turn to our dataset of images within tweets posted by MCs. We start with the subset for which we validated image labels with crowdworkers. In the crowd validation dataset each row represents the validation of one particular image label. Note that we only consider labels (and recognized objects) with GCV confidence scores of at least .75.

## Parsed with column specification:
## cols(
##   resp_id = col_double(),
##   label_response_number = col_double(),
##   image_number = col_double(),
##   labelresponse = col_double(),
##   men = col_double(),
##   women = col_double(),
##   children = col_double(),
##   none = col_double(),
##   label = col_character(),
##   media_url = col_character(),
##   confidence = col_double(),
##   real_label = col_double(),
##   correct = col_double()
## )
## # A tibble: 6 x 13
##   resp_id label_response_~ image_number labelresponse   men women children  none
##     <dbl>            <dbl>        <dbl>         <dbl> <dbl> <dbl>    <dbl> <dbl>
## 1     364                1            1             0     1     0        0     0
## 2     364                2            1             0     1     0        0     0
## 3     364                3            1             0     1     0        0     0
## 4     364                4            1             0     1     0        0     0
## 5     364                5            1             0     1     0        0     0
## 6     364                6            1             1     1     0        0     0
## # ... with 5 more variables: label <chr>, media_url <chr>, confidence <dbl>,
## #   real_label <dbl>, correct <dbl>

4.2 Object recognition in Twitter images

We repeat the procedure above to examine error rates for recognizing men and women in our validation sample. We first reshape the data and consider a man or woman as being present in an image if they were recognized by more than 50% of crowd workers.

## `summarise()` ungrouping output (override with `.groups` argument)

Object recognition output has to be merged from our full dataset of tweet images:

## # A tibble: 5 x 2
##   tw_img_objects tw_img_objectsconf75  
##   <chr>          <chr>                 
## 1 <NA>           <NA>                  
## 2 <NA>           <NA>                  
## 3 <NA>           <NA>                  
## 4 Man, Tie       0.81154346, 0.75388706
## 5 Man            0.7736814
crowd_obj_df <- crowd_df %>% mutate(
    objects = str_split(tw_img_objects, ', '),
    obj_person =  objects %>%  
      map_chr(~  ifelse('Person' %in% ., 'recognized', 'not recognized')),
        obj_men =  objects %>%  
      map_chr(~  ifelse('Man' %in% ., 'recognized', 'not recognized')),
    obj_women =  objects %>%  
      map_chr(~  ifelse('Woman' %in% ., 'recognized', 'not recognized')),

    )
# adding vectors in new data frame
obj_vector <- c(crowd_obj_df$obj_men, crowd_obj_df$obj_women)
person_vector <- c(crowd_obj_df $men, crowd_obj_df$women)
gender_vector <- c(rep('Men', length(crowd_obj_df$obj_men)),
                   rep('Women', length(crowd_obj_df$obj_women)))

crowd_obj_df <- tibble(results = obj_vector,
               crowd = person_vector,
               gender = gender_vector) %>%
  mutate(combined = str_c(crowd, results, sep = ', '))

# create performance statistics
crowd_obj_errors <- 
  crowd_obj_df %>% group_by(crowd, gender) %>% 
  count(results) %>%
  mutate(rel = n / sum(n)) %>% ungroup() %>%
  mutate(
    gender = factor(gender, levels = c('Women', 'Men')),
    crowd = as.factor(crowd),
    results = as.factor(results),
    stat = case_when(
      crowd == 'in image' & results == 'recognized' ~ 'True Positive',
      crowd == 'not in image' &
        results == 'recognized' ~ 'False Positive',
      crowd == 'not in image' &
        results == 'not recognized' ~ 'True Negative',
      TRUE ~ 'False Negative'
    ) %>%
      factor(
        levels = c(
          'True Positive',
          'False Positive',
          'True Negative',
          'False Negative'
        )
      ),
    results_label = case_when(
      results == "recognized" ~ "Recognized\nby GCV",
      results == "not recognized" ~ "Not recognized\nby GCV"
    )
  )

crowd_obj_errors
## # A tibble: 8 x 7
##   crowd      gender results          n     rel stat         results_label       
##   <fct>      <fct>  <fct>        <int>   <dbl> <fct>        <chr>               
## 1 in image   Men    not recogni~  3584 0.547   False Negat~ "Not recognized\nby~
## 2 in image   Men    recognized    2966 0.453   True Positi~ "Recognized\nby GCV"
## 3 in image   Women  not recogni~  3525 0.742   False Negat~ "Not recognized\nby~
## 4 in image   Women  recognized    1223 0.258   True Positi~ "Recognized\nby GCV"
## 5 not in im~ Men    not recogni~  2596 0.977   True Negati~ "Not recognized\nby~
## 6 not in im~ Men    recognized      62 0.0233  False Posit~ "Recognized\nby GCV"
## 7 not in im~ Women  not recogni~  4417 0.990   True Negati~ "Not recognized\nby~
## 8 not in im~ Women  recognized      43 0.00964 False Posit~ "Recognized\nby GCV"

The following graph is included in in the main paper as Figure 4 and visualizes the performance metrics by gender:

4.3 Image labels - key differences

At last, we examine key differences in image labels by gender for our large dataset of images in tweets posted by MCs. We first merge demographic information from our MC dataset and keep only Democrats and Republicans.

## # A tibble: 3 x 2
##   party           n
##   <chr>       <int>
## 1 Democrat      243
## 2 Independent     2
## 3 Republican    295
## # A tibble: 4 x 3
##   gender party          n
##   <chr>  <chr>      <int>
## 1 Female Democrat   35867
## 2 Female Republican 13454
## 3 Male   Democrat   61466
## 4 Male   Republican 72110

We create the feature matrix and remove very rare terms:

## [1] 182864    659

Now, we repeat the procedure from above and calculate Chi² statistics and relative frequencies by gender. The following plot is included as Figure 8 in the main paper.

To analyze whether gender differences in labels are similar regardless of ideology, we again compute Chi² statistics, but this time by gender and party. For visualization purposes, we only display 20 labels per group without relative frequencies and sorted by Chi² values. The plot is available in the supplementary material.

tweet_dfm_pg <- dfm_group(tweet_dfm, groups = 'gender_party')
colnames(tweet_dfm_pg) <- colnames(tweet_dfm_pg) %>% 
  str_replace_all('_', ' ') %>% str_trim()

key_fd <- textstat_keyness(tweet_dfm_pg,  target = 'Female Democrat',
                        measure = 'chi2') %>% 
  mutate(gender_party = 'Female Democrat') %>% 
  head(20)

key_fr <- textstat_keyness(tweet_dfm_pg,  target = 'Female Republican',
                        measure = 'chi2') %>% 
  mutate(gender_party = 'Female Republican') %>% 
  head(20)

key_md <- textstat_keyness(tweet_dfm_pg,  target = 'Male Democrat',
                        measure = 'chi2') %>% 
  mutate(gender_party = 'Male Democrat') %>% 
  head(20)

key_mr <- textstat_keyness(tweet_dfm_pg,  target = 'Male Republican',
                        measure = 'chi2') %>% 
  mutate(gender_party = 'Male Republican') %>% 
  head(20)

key_df <- bind_rows(key_fd, key_fr, key_md, key_mr) %>% 
  mutate(feature = reorder_within(feature, chi2, gender_party),
           gender_party = case_when(
            gender_party ==  'Female Democrat' ~ "Women - Democratic Party",
  gender_party ==  'Female Republican' ~ "Women - Republican Party",
   gender_party == 'Male Democrat' ~ "Men - Democratic Party",
   gender_party == 'Male Republican' ~ "Men - Republican Party"  
)) 


key_df %>%
  ggplot(aes(feature, chi2, fill = gender_party)) +
  geom_col(show.legend = FALSE) + 
  scale_x_reordered() + 
  facet_wrap(~ gender_party, scales = 'free') + 
  coord_flip() +
  labs(y = "Chi² Value", x = NULL) +
  theme_ipsum(#grid = 'XY',
                base_size = 13,
                axis_title_size = 10,
                caption_face = "italic",
                caption_size = 10,
                axis = FALSE,
                ticks = TRUE
              ) +
  theme(
    legend.position = "bottom",
    legend.box.background = element_rect(colour = "grey80"),
    legend.text = element_text(size = 8),
    legend.title = element_text(size = 10),
    axis.text.x = element_text(size = 10),
    axis.text.y = element_text(size = 10),
    strip.text.x = element_text(size = 10),
    axis.title.y = element_text(size = 10)
  )   +
  scale_fill_manual(
    values = c("navyblue", "darkred", "navyblue", "darkred")
  ) +
  labs(x = NULL, y = NULL)